4th WEEK: Clustering and classification

Cluster analysis is one of the main tasks of exploratory data mining and is thus the topic of this week’s exercise. Clustering techniques identify similar groups or clusters among observations so that members within any segment are more similar while data across segments are different. However, defining what is meant by that requires often a lot of contextual knowledge and creativity.

Introduction

The Boston data will be used. It can be loaded from the R package MASS.According to the ?Boston, the data frame has 506 rows (observations) of 14 columns (variables). Briefly, the data report several variables potentially explaining housing values around Boston. Our aim is to classify the included suburbs from Boston data set into classes based on their characteristics. The variables are:

  • crim: per capita crime rate by town

  • zn:proportion of residential land zoned for lots over 25,000 sq.ft.

  • indus:proportion of non-retail business acres per town.

  • chas:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • nox:nitrogen oxides concentration (parts per 10 million).

  • rm:average number of rooms per dwelling.

  • age:proportion of owner-occupied units built prior to 1940.

  • dis:weighted mean of distances to five Boston employment centres.

  • rad:index of accessibility to radial highways.

  • tax:full-value property-tax rate per $10,000.

  • ptratio:pupil-teacher ratio by town.

  • black:1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

  • lstat:lower status of the population (percent).

  • medv:median value of owner-occupied homes in $1000s

Firstly, the data are loaded, glimpsed and thereafter summaries are printed.

# load data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Summaries and plots

Basic variable characteristics

The descriptive summary includes values of skewness (a measure of the symmetry in a distribution) and kurtosis (measuring the “tail-heaviness” of the distribution) and already shows the separating quantile values for crime to be further used later.

tab1<-CreateTableOne(vars=c("crim","zn", "indus","chas","nox" ,"rm"   
 ,"age","dis" ,"rad" , "tax" ,"ptratio", "black"  
,"lstat" , "medv"), factorVars = c("rad", "chas"),data=Boston)
summary(tab1)
## 
##      ### Summary of continuous variables ###
## 
## strata: Overall
##           n miss p.miss  mean    sd median   p25   p75   min   max skew
## crim    506    0      0   3.6   8.6    0.3 8e-02   3.7 6e-03  89.0  5.2
## zn      506    0      0  11.4  23.3    0.0 0e+00  12.5 0e+00 100.0  2.2
## indus   506    0      0  11.1   6.9    9.7 5e+00  18.1 5e-01  27.7  0.3
## nox     506    0      0   0.6   0.1    0.5 4e-01   0.6 4e-01   0.9  0.7
## rm      506    0      0   6.3   0.7    6.2 6e+00   6.6 4e+00   8.8  0.4
## age     506    0      0  68.6  28.1   77.5 5e+01  94.1 3e+00 100.0 -0.6
## dis     506    0      0   3.8   2.1    3.2 2e+00   5.2 1e+00  12.1  1.0
## tax     506    0      0 408.2 168.5  330.0 3e+02 666.0 2e+02 711.0  0.7
## ptratio 506    0      0  18.5   2.2   19.1 2e+01  20.2 1e+01  22.0 -0.8
## black   506    0      0 356.7  91.3  391.4 4e+02 396.2 3e-01 396.9 -2.9
## lstat   506    0      0  12.7   7.1   11.4 7e+00  17.0 2e+00  38.0  0.9
## medv    506    0      0  22.5   9.2   21.2 2e+01  25.0 5e+00  50.0  1.1
##          kurt
## crim    37.13
## zn       4.03
## indus   -1.23
## nox     -0.06
## rm       1.89
## age     -0.97
## dis      0.49
## tax     -1.14
## ptratio -0.29
## black    7.23
## lstat    0.49
## medv     1.50
## 
## =======================================================================================
## 
##      ### Summary of categorical variables ### 
## 
## strata: Overall
##   var   n miss p.miss level freq percent cum.percent
##  chas 506    0    0.0     0  471    93.1        93.1
##                           1   35     6.9       100.0
##                                                     
##   rad 506    0    0.0     1   20     4.0         4.0
##                           2   24     4.7         8.7
##                           3   38     7.5        16.2
##                           4  110    21.7        37.9
##                           5  115    22.7        60.7
##                           6   26     5.1        65.8
##                           7   17     3.4        69.2
##                           8   24     4.7        73.9
##                          24  132    26.1       100.0
## 

Distribution plots

To get a better idea of the variables and their distributions some plots are generated.

#density plots for numerical variables
Boston %>%
  keep(is.numeric) %>%                     # keep only numeric columns
  gather() %>%                             # convert to key-value pairs
  ggplot(aes(value)) +                     # plot the values
    facet_wrap(~ key, scales = "free") +   # in separate panels
    geom_density()                         # as density

#histograms for integer variables
Boston %>%
  keep(is.integer) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(bins=10)

Due to variable characteristics (percents, proportions or function based values) it is understandable that most of them have rather uneven/ skewed distributions: e.g. proportion of black people (scaled proportion of blacks), indus (proportion of non-retail business acres), age (proportion of owner-occupied units built prior to 1940), proportion of land zond for very large lots (zn) and lstat (lower status of the population (percent)). On the contrary, dwelling size referring to the number of rooms (rm) is normally distributed and median value of owner-occupied homes (medv) can also be judged to be. Charles River dummy variable (chas) is binary (1/0) referring to the river crossing the area and the radial highways accessibility (rad) is an interval scaled index.

Crime rate per capita

#to get a better idea about variable crim it is plotted separately
plot(Boston$crim, col="red",pch=8, main="Crime rate per capita")
text(198,59," Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.01    0.08    0.26    3.61    3.68   88.98" )

ggplot(Boston, aes(x = crim)) +
  stat_density(aes(col="red"),position="identity",geom="line",size=2)+
  ggtitle("Crime rate per capita")+ theme(legend.position="none")

We are especially interested in the variable crim. Thus, it is visualized separately. Crime rate varies a lot between areas ranging from min=0.01 to max=88.98. Quite a few high outlier values among the 506 observations as can be seen in the plot contribute to the low average value of 3.61 and median value of 0.26. The distribution is strongly skewed to the left.

Variable correlations

To explore the relations between the variables of the data set pairwise scatter plots and a correlation plots are printed.

#scatterplots
pairs(Boston,lower.panel = NULL)

To me these scatter plots are not that informative, though. Thus, I will try another approach where the correlation chart presents simultanously both the direction (color) and the magnitude (size of the circle) as well as the values of the correlation.

#a more visual correlation matrix
cor_matrix<-cor(Boston) %>% round(2)
corrplot.mixed(cor_matrix,number.cex=0.65,tl.cex=0.6)

And, finally, I create just a conservative table of correlations with notions for significance levels with the help of this.

library(xtable)  
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
                     result=c("none", "html", "latex")){
    #Compute correlation matrix
    require(Hmisc)
    x <- as.matrix(x)
    correlation_matrix<-rcorr(x, type=method[1])
    R <- correlation_matrix$r # Matrix of correlation coeficients
    p <- correlation_matrix$P # Matrix of p-value 
    
    ## Define notions for significance levels; spacing is important.
    mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "**  ", ifelse(p < .05, "*   ", "    "))))
    
    ## trunctuate the correlation matrix to two decimal
    R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
    
    ## build a new matrix that includes the correlations with their apropriate stars
    Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
    diag(Rnew) <- paste(diag(R), " ", sep="")
    rownames(Rnew) <- colnames(x)
    colnames(Rnew) <- paste(colnames(x), "", sep="")
    
    ## remove upper triangle of correlation matrix
    if(removeTriangle[1]=="upper"){
      Rnew <- as.matrix(Rnew)
      Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove lower triangle of correlation matrix
    else if(removeTriangle[1]=="lower"){
      Rnew <- as.matrix(Rnew)
      Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
      Rnew <- as.data.frame(Rnew)
    }
    
    ## remove last column and return the correlation matrix
    Rnew <- cbind(Rnew[1:length(Rnew)-1])
    if (result[1]=="none") return(Rnew)
    else{
      if(result[1]=="html") print(xtable(Rnew), type="html")
      else print(xtable(Rnew), type="latex") 
    }
} 
# Make table with centered columns
corstars(Boston,result="html")
crim zn indus chas nox rm age dis rad tax ptratio black lstat
crim
zn -0.20****
indus 0.41**** -0.53****
chas -0.06 -0.04 0.06
nox 0.42**** -0.52**** 0.76**** 0.09*
rm -0.22**** 0.31**** -0.39**** 0.09* -0.30****
age 0.35**** -0.57**** 0.64**** 0.09 0.73**** -0.24****
dis -0.38**** 0.66**** -0.71**** -0.10* -0.77**** 0.21**** -0.75****
rad 0.63**** -0.31**** 0.60**** -0.01 0.61**** -0.21**** 0.46**** -0.49****
tax 0.58**** -0.31**** 0.72**** -0.04 0.67**** -0.29**** 0.51**** -0.53**** 0.91****
ptratio 0.29**** -0.39**** 0.38**** -0.12** 0.19**** -0.36**** 0.26**** -0.23**** 0.46**** 0.46****
black -0.39**** 0.18**** -0.36**** 0.05 -0.38**** 0.13** -0.27**** 0.29**** -0.44**** -0.44**** -0.18****
lstat 0.46**** -0.41**** 0.60**** -0.05 0.59**** -0.61**** 0.60**** -0.50**** 0.49**** 0.54**** 0.37**** -0.37****
medv -0.39**** 0.36**** -0.48**** 0.18**** -0.43**** 0.70**** -0.38**** 0.25**** -0.38**** -0.47**** -0.51**** 0.33**** -0.74****

 

From all the information above it can be captured that there are several relevant correlations between the variables. Strong negative correlations exist between weighted mean distances to five Boston employment centres (dis) and proportion of non-retail business acres per town (indus) / nitrogen oxide (nox) / and older properties (age). Not surprisingly, a strong negative correlation between lower status of the population (percent) and median home value (medv) is seen. Strong positive correlations especially between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) / proportion of non-retail business acres per town (indus) exist. Proportion of non-retail business acres per town (indus) is further positively correlated with nitrogen oxide (nox) and full-value property tax-rate (tax). Furthermore, one of our main interests, the crime rate is correlated with many of the variables: e.g. negatively with e.g. distance to employment centers (dis) and median home value (medv), positively with e.g. full-value property tax-rate (tax) and access to radial highways (rad). Thus, an increase in crime rate seems to be associated with an increasing highway accessibility index and property tax.

Scaling the dataset and categorising crime rate

Linear discriminant analysis is a method generating linear combinations to charachterize variable classes. To enable the method the data set needs to be standardized, i.e. all variables fit to normal distribution so that the mean of every variable is zero by ubtracting the column means from the corresponding columns and dividing the difference with standard deviation:

\[ scaled(x) = \frac{x-means(x)}{sd(x)} \]

#scale the dataset
boston_scaled <- as.data.frame(scale(Boston))

In comparison, the values of the latter table have decreased, and all mean values are converted to zero and standard deviations to 1.

In addition to scaling, a categorical variable of the scaled crime rate has to be created. Quantiles are used for this to yield four grouping values: low, medium low, medium high and high crime rates and thus four groups with approximatey equal numbers of observation each.

Next, the data set is randomly spit for the analysis to train (80%) and test (20%) sets. Thus, the train set has 404 and the test set 102 variables.

# create a quantile vector of crim, and use it to categorize crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# explore the categorised variable.
table(boston_scaled$crim)
## 
##      low  med_low med_high     high 
##      127      126      126      127

To compare how the original values by the created groups differ between the groups a summary is created stratified by the different crime rate categories. There are many large differences, especially when comparing the high and low crime rate groups.

Boston2<-Boston
Boston2$crime<-boston_scaled$crim
CreateTableOne(vars=c("zn", "indus","chas","nox" ,"rm"     
 ,"age","dis" ,"rad" , "tax" ,"ptratio", "black"  
,"lstat" , "medv"), strata=c("crime"), test=FALSE, data=Boston2)
##                      Stratified by crime
##                       low            med_low         med_high      
##   n                      127            126             126        
##   zn (mean (sd))       33.85 (34.67)   9.11 (14.54)    2.41 (6.59) 
##   indus (mean (sd))     4.88 (4.28)    9.14 (5.80)    12.41 (6.57) 
##   chas (mean (sd))      0.04 (0.20)    0.06 (0.24)     0.12 (0.33) 
##   nox (mean (sd))       0.45 (0.05)    0.49 (0.06)     0.60 (0.11) 
##   rm (mean (sd))        6.60 (0.56)    6.19 (0.47)     6.36 (0.88) 
##   age (mean (sd))      43.65 (19.59)  59.03 (29.05)   80.18 (21.91)
##   dis (mean (sd))       5.64 (2.08)    4.54 (1.93)     3.00 (1.25) 
##   rad (mean (sd))       3.54 (1.59)    4.78 (1.50)     5.96 (4.28) 
##   tax (mean (sd))     283.83 (63.32) 327.83 (102.02) 356.32 (91.50)
##   ptratio (mean (sd))  17.50 (1.89)   18.32 (1.64)    17.84 (2.86) 
##   black (mean (sd))   391.25 (9.19)  386.14 (30.87)  364.64 (56.91)
##   lstat (mean (sd))     7.15 (2.79)   11.71 (5.53)    12.74 (6.91) 
##   medv (mean (sd))     27.37 (8.00)   22.51 (5.38)    24.10 (10.28)
##                      Stratified by crime
##                       high           
##   n                      127         
##   zn (mean (sd))        0.00 (0.00)  
##   indus (mean (sd))    18.11 (0.13)  
##   chas (mean (sd))      0.06 (0.23)  
##   nox (mean (sd))       0.68 (0.06)  
##   rm (mean (sd))        6.00 (0.69)  
##   age (mean (sd))      91.45 (9.95)  
##   dis (mean (sd))       2.00 (0.55)  
##   rad (mean (sd))      23.85 (1.69)  
##   tax (mean (sd))     663.93 (23.34) 
##   ptratio (mean (sd))  20.16 (0.49)  
##   black (mean (sd))   284.96 (147.79)
##   lstat (mean (sd))    19.00 (6.85)  
##   medv (mean (sd))     16.16 (8.63)

Fitting the model

LDA analysis on the train set

Linear Discriminant Analysis (LDA) model is carried out to classify the suburbs using the categorized crime rate as the target variable. Firstly, classification is performed on the training dataset, and thereafter the classes are predicted on the test data.

set.seed(123)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

LDA (bi)plots

Based on the analysis results are plotted firstly plain, and thereafter as a LDA (bi)plot with the help of a specificifally generated “arrow”-function to add arrows. It has to be kept in mind that for plotting the classes have to tranformed from categorical to numeric.

#helper function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
#crime classes to numeric for plotting
classes <- as.numeric(train$crime) 
#plotting the lda
p1<-plot(lda.fit, dimen = 2, col = classes, pch = classes)

#(bi)plot
p2<-plot(lda.fit, dimen = 2, col = classes, pch = classes)
#arrows 
lda.arrows(lda.fit) 

print(lda.fit) 
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2475248 0.2549505 0.2475248 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9432005 -0.8959713 -0.11640431 -0.8577158  0.44620905
## med_low  -0.1512978 -0.3010310 -0.07547406 -0.5591764 -0.11907536
## med_high -0.4023185  0.2609362  0.18636222  0.4392319  0.04779161
## high     -0.4872402  1.0149946 -0.07547406  1.0412268 -0.50541663
##                 age        dis        rad        tax     ptratio
## low      -0.8546026  0.8232513 -0.6953230 -0.7314517 -0.45097009
## med_low  -0.3467956  0.2952511 -0.5420083 -0.4653999 -0.03350366
## med_high  0.4318874 -0.4060664 -0.4087527 -0.2821208 -0.32611400
## high      0.7952399 -0.8387924  1.6596029  1.5294129  0.80577843
##                black       lstat        medv
## low       0.37835870 -0.76633833  0.51252310
## med_low   0.32163001 -0.18506538  0.01078532
## med_high  0.08166164  0.03210543  0.15583308
## high     -0.74993410  0.87136020 -0.65964311
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08288224  0.74545012 -1.02758413
## indus    0.03371880 -0.27414286  0.12800485
## chas    -0.07688231 -0.04569133  0.03313223
## nox      0.24620783 -0.70692830 -1.31233352
## rm      -0.13308714 -0.06060304 -0.15792484
## age      0.29339946 -0.28245910 -0.17691125
## dis     -0.04642187 -0.28428720  0.07074842
## rad      3.40029809  0.91155778 -0.10269503
## tax      0.04070331 -0.03257583  0.61283757
## ptratio  0.11422801  0.09094945 -0.21802056
## black   -0.15410658  0.03229362  0.15457698
## lstat    0.19460392 -0.28843142  0.29476658
## medv     0.17982251 -0.40689877 -0.20349106
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9510 0.0368 0.0122

Prediction

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
##           predicted
## correct    low med_low med_high high Sum
##   low       16      10        0    0  26
##   med_low    7      15        4    0  26
##   med_high   1       6       15    1  23
##   high       0       0        1   26  27
##   Sum       24      31       20   27 102

By crosstabulating the correct and predicted classes it can be seen that the model’s predictions are quite accurate with the high category in the test data. On the contrary, the low areas are not recognized that well, and both the medium low and medium high classes seem to be problematic. By reflecting the results to the graph it can be captured that the separation is clearest with regard to the highest class. Due to that the prediction accuracies are understandable.

K-means

In comparison to LDA, K-means is a clustering method that divides observations into clusters.

Eclidean and Manhattan distances

For K means clustering, the Boston dataset is rescaled, so that the distances are comparable. To examine the distance properties of the data and compare methods superficially both the Euclidian (geometric) and Manhattan (along the axes) distance summaries are printed.

data(Boston)
#center and standardize variables and make it a data frame
boston_scaled<-as.data.frame(scale(Boston))
#Euclidean distance matrix
dist_eu<-dist(boston_scaled)
#for comparison Manhattan distance matrix
dist_man<-dist(boston_scaled,method = "manhattan" )
#summaries 
summary(dist_eu)#Euclidian
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
summary(dist_man)#Manhattan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Preliminary K-means and determination of the optimal number of clusters

K-means algorith is exploratorily ran on the dataset using 5 clusters.

#kmeans using euclidean and five clusters
km <- kmeans(dist_eu, centers = 5)
pairs(boston_scaled, col = km$cluster,lower.panel = NULL)

At the first sight the plotted reslts look a little like fireworks. Before interpreting the plot in more detail the optimal number of clusters using the total within cluster sum of squares (WCSS) with the number of clusters ranging from 1 to 10 is estimated and the results visualized.

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
#Plot the results
plot(1:k_max, twcss, type = 'b')

The most prominent change in the sum of squares happens at 2. However, there are still drops after two, too, but they only yield small improvements. Yet, I choose to carry out the k-means clustering again using 2 as the number of clusters.

K- means using the optimal number of clusters

km <- kmeans(dist_eu, centers = 2)
pairs(boston_scaled, col = km$cluster, lower.panel = NULL)

The new pairwise scatter plot with only two clusters looks better than the previous one. All data points are assigned to two red/black clusters. The clearer separation for the colours the more relevant for clustering the variable. Property tax and access to radial highways seem to discriminate quite well between the two clusters.

Bonus

K-means is performed on the scaled Boston data using 4 clusters. Therafter, LDA is fitted using the generated clusters as target classes. Biplot is printed.

set.seed(123)
data(Boston)
boston_scaled4 <- as.data.frame(scale(Boston,center=TRUE,scale = TRUE))
dist_eu4 <- dist(boston_scaled4)
km2 <-kmeans(dist_eu4, centers = 4)
#pairs(boston_scaled4, col = km$cluster, lower.panel = NULL)
boston_scaled4$classes<-km2$cluster
lda.fit2 <- lda(boston_scaled4$classes ~., data = boston_scaled4)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
plot(lda.fit2, dimen = 2, col= as.numeric(boston_scaled4$classes), pch=classes)
lda.arrows(lda.fit2, myscale = 2)

Super-Bonus

The given code is ran on the scaled train set. A matrix is created including projections of the data points.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

3D plot without colours is created:

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

3D plot is created the categorized crime rate classes defining the colours:

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crim)

3D plot is created the last K-mean clusters defining the colours:

train$cluster<-km2$cluster[match(rownames(train),rownames(boston_scaled4))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cluster))